Boston data is included to R-package as a demonstartion or example.
describe the dataset briefly.
'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : int 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
[1] 506 14
show summaries of the variables in the data.
crim zn indus chas
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
nox rm age dis
Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
rad tax ptratio black
Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
Median : 5.000 Median :330.0 Median :19.05 Median :391.44
Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
lstat medv
Min. : 1.73 Min. : 5.00
1st Qu.: 6.95 1st Qu.:17.02
Median :11.36 Median :21.20
Mean :12.65 Mean :22.53
3rd Qu.:16.95 3rd Qu.:25.00
Max. :37.97 Max. :50.00
Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
##### Correlations between variable
Show a graphical overview of the data and Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
In standardization means of all variables are in zero. That is, variables have distributed around zero. This can be seen in summary table.
Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)
crim zn indus
Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
Median :-0.390280 Median :-0.48724 Median :-0.2109
Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
chas nox rm age
Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
dis rad tax ptratio
Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
black lstat medv
Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
Median : 0.3808 Median :-0.1811 Median :-0.1449
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
[1] "matrix"
crime
low med_low med_high high
127 126 126 127
Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.
Call:
lda(crime ~ ., data = train)
Prior probabilities of groups:
low med_low med_high high
0.2549505 0.2524752 0.2500000 0.2425743
Group means:
zn indus chas nox rm
low 0.96766847 -0.8903391 -0.157656245 -0.8725799 0.46150143
med_low -0.05342445 -0.3145688 -0.002135914 -0.5905480 -0.05494454
med_high -0.38450487 0.2172987 0.312388789 0.4471085 0.13546139
high -0.48724019 1.0171960 -0.031282114 1.0655927 -0.39780090
age dis rad tax ptratio
low -0.8785866 0.8666066 -0.6863918 -0.7687758 -0.38665499
med_low -0.4523360 0.3986506 -0.5461293 -0.4832932 -0.08497441
med_high 0.4838966 -0.4194526 -0.4292425 -0.3149967 -0.33023442
high 0.8216012 -0.8542563 1.6373367 1.5134896 0.77985517
black lstat medv
low 0.3857239 -0.74727885 0.53649322
med_low 0.3216575 -0.23909288 0.04536142
med_high 0.0728271 0.04156773 0.20183555
high -0.8241154 0.91310069 -0.72639881
Coefficients of linear discriminants:
LD1 LD2 LD3
zn 0.07259113 0.63535463 -1.10554588
indus 0.07877088 -0.14843165 0.25061387
chas -0.08036377 -0.11387612 0.10468902
nox 0.29373420 -0.76676454 -1.21140780
rm -0.05252219 -0.11606378 -0.10042880
age 0.30240880 -0.38963008 -0.30941564
dis -0.01626732 -0.26942519 0.12853569
rad 3.36950641 0.99249418 -0.29354639
tax 0.06529043 -0.10816106 0.81682405
ptratio 0.10232045 0.04419032 -0.40394980
black -0.11478392 0.03045567 0.08291172
lstat 0.30004076 -0.25474792 0.22370173
medv 0.23275732 -0.41810670 -0.27921866
Proportion of trace:
LD1 LD2 LD3
0.9496 0.0384 0.0120
Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.
predicted
correct low med_low med_high high Sum
low 13 11 0 0 24
med_low 0 16 8 0 24
med_high 0 8 15 2 25
high 0 0 0 29 29
Sum 13 35 23 31 102
[1] 0.2843137
Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)
[1] "matrix"
Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters?
Super-Bonus:
Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities?
[1] 404 13
[1] 13 3